MicroRNA target prediction using thermodynamic and sequence curves
نویسندگان
چکیده
Background: MicroRNAs (miRNAs) are small regulatory RNA that mediate RNA interference by binding to various mRNA target regions. There have been several computational methods for the identification of target mRNAs for miRNAs. However, these have considered all contributory features as scalar representations, primarily, as thermodynamic or sequence-based features. Further, a majority of these methods solely target canonical sites, which are sites with “seed” complementarity. Here, we present a machine-learning classification scheme, titled Avishkar, which captures the spatial profile of miRNA-mRNA interactions via smooth B-spline curves, separately for various input features, such as thermodynamic and sequence features. Further, we use a principled approach to uniformly model canonical and non-canonical seed matches, using a novel seed enrichment metric. Results: We demonstrate that large number of seed-match patterns have high enrichment values, conserved across species, and that majority of miRNA binding sites involve non-canonical matches, corroborating recent findings. Using spatial curves and popular categorical features, such as target site length and location, we train a linear SVM model, utilizing experimental CLIP-seq data. Our model significantly outperforms all established methods, for both canonical and non-canonical sites. We achieve this while using a much larger candidate miRNA-mRNA interaction set than prior work. Conclusions: We have developed an efficient SVM-based model for miRNA target prediction using recent CLIP-seq data, demonstrating superior performance, evaluated using ROC curves, specifically about 20 % better than the state-of-the-art, for different species (human or mouse), or different target types (canonical or non-canonical). To the best of our knowledge we provide the first distributed framework for microRNA target prediction based on Apache Hadoop and Spark. Availability: All source code and data is publicly available at https://bitbucket.org/cellsandmachines/avishkar. Background MicroRNAs (miRNAs) are short 20–24 nucleotide (nt), endogenous RNAs that modulate gene regulatory pathways [1, 2] and form the most widely studied class of non-coding RNAs (ncRNAs). miRNAs mediate RNA interference (RNAi) by targeting the 3’ UTR of themRNA, or in some cases, other mRNA regions, such as the mRNA’s coding sequence (CDS) or its 5’ UTR [3]. Following their biogenesis, miRNAs complex with Argonaute (AGO) proteins, which are the catalytic components of the RNA-induced silencing complex (RISC) [4]. This *Correspondence: [email protected] 1Department of Computer Science, Purdue University, West Lafayette, IN, 47907, USA Full list of author information is available at the end of the article miRNA-RISC complex then targets its cognate mRNA fragment. These interactions result in mRNA repression, destabilization, or, in more complex ways, contour the gene expression landscape [5, 6]. There are over two thousand miRNAs that have been annotated in humans [7], displaying many-to-many associations with mRNA targets. Such associations are speculated to be controlling a vast majority of mammalian genes [8], involving all cellular pathways, from development to pluripotency to oncogenesis [9–14]. Notwithstanding the biological importance of miRNAs, determining their targets with high accuracy and exhaustively has remained elusive, with in-silico predictions plagued by high false-positive and false-negative rates [15]. This is due inmany ways to the small size of miRNAs, © 2016 Ghoshal et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ghoshal et al. BMC Genomics (2015) 16:999 Page 2 of 21 which requires as few as 6 base pairs of complementarity for functional miRNA targeting, as well as the diverse miRNA targetome [16]. As a machine learning task, the problem of miRNA target prediction is that of link prediction in a bipartite graph, where vertices in one set represent all possible target regions across all mRNAs while vertices in the other set represent miRNAs. We can either predict if an edge exists (1/0) between a pair of vertices representing an mRNA region and a miRNA (classification), or we can predict the strength of the association i.e., edge weights (regression). In this paper, we focus on the classification problem of whether a miRNA targets an mRNA region. CLIP-seq, crosslinking via immunoprecipitation followed by high-throughput sequencing, an elegant albeit lengthy biochemical procedure, is a state-of-the-artplayer in developing genome-scale regulatory insights [17–19]. The technology allows target mRNAs to be identified within a small window of resolution, beyond which, statistical models are needed to exactly localize the MRE, that is, the miRNA recognition element or the binding site. This is true even for recent CLIP-seq variants [19], in order to account for background noise and sequencing artifacts [20, 21]. Further, CLIP-seq has the advantage of profiling the native miRNA levels, as opposed to supra-physiological levels obtained via miRNA transfection experiments [22], the latter being better suited for developing small-interfering RNA (siRNA)-based therapeutics [23, 24]. While CLIP-seq can identify miRNAs and targets that form a part of the RISC complex, it cannot decipher which miRNA forms a heteroduplex with which targets. CLASH is an initial attempt in experimentally solving this problem [25]. Several computational methods have been developed to decipher the specifics of miRNA-mRNA interactions captured by CLIP-seq [26–31]. These methods have contributed to understanding the diverse nature of interactions between miRNA and mRNA. The evolving knowledgebase has further supported the paradigm switch, wherein it is now widely appreciated that the perfect complementarity between the miRNA seed and the mRNA 3’ UTR is neither necessary nor sufficient for miRNA regulation. Our contribution In this paper, we seek to leverage this ability of the CLIP-seq technology to capture endogenous MREs to develop a unified method to understand the signatures of miRNA-mRNA heteroduplexes. Our method applies equally to standard, canonical seed matches, and non-standard, non-canonical seedless matches1. Specifically, in our system, which we call Avishkar2, we use smooth B-spline, thermodynamic curves and sequence curves for adenosine-uracil (AU) content, in order to extract enriched interaction features from the experimentally CLIPed (i.e., immunoprecipitated) regions. Our main contributions through this work can be summarized as follows: 1. We develop an efficient Support Vector Machine (SVM)-based classifier to identify the positive miRNA-mRNA interactions. Our classifier produces significantly better ROC curves than all prior work [26–28, 32, 33] when evaluated on CLIP-seq data, while also providing insights on which features are discriminating, and in which direction, that is, positive or negative interactions. Our Area-Under-the-Curve (AUC) values for the ROC curves for both human and mouse datasets are greater than that of all prior works, quantitatively 19.7 % and 22.0 % better for human (seed and seedless respectively) and 15.0 % and 22.8 % better for mouse (seed and seedless respectively). The classification performance of our model in inter-species validations while being slightly worse compared to intra-species validations, is still able to beat all prior methods. Our improved performance (in terms of true-positive and false-positive rates) over all prior work arises from a combination of multiple factors, with the total benefit being greater than the sum of the constituents. The contributory factors are the use of an extensive set of features, converting noisy data points into smooth curves, converting the categorical feature of seed or seedless match into a numerical feature and treating both under one unified umbrella, and a careful consideration of the spatial nature of the miRNA-mRNA binding process into our classification scheme. Our candidate dataset of miRNA-mRNA interactions is the largest among other computational approaches, which we achieve by employing the least strict filtering criteria on the original CLIP-seq data. Finally, our method is able to predict significantly more non-canonical sites that are present within CLIPed regions than prior computational approaches. 2. We characterize thermodynamic and sequence scores as “curves” and demonstrate how the shape of the curves discriminates between positive and negative miRNA-mRNA interactions. We compute curves at two levels of granularity for each of the thermodynamic and sequence features—curves centered at the target site (we refer to them as “site curves”) and curves computed at a finer granularity and centered at the mRNA seed-matched region (we call them “seed curves”). We demonstrate that a sum of 20 basis-splines (B-splines), each of degree 3, gives us satisfactory curve-fitting. Our use of B-splines Ghoshal et al. BMC Genomics (2015) 16:999 Page 3 of 21 enables us to fit relatively smooth curves over high dimensional, noisy data—the scalar data points for thermodynamic and sequence scores. 3. We develop and incorporate in our model a novel metric called seed enrichment that captures all patterns of seed matches, including multiple mismatches, GU wobbles (sequence-based imperfections), and long bulges (architectural imperfections), in forming the miRNA-mRNA heteroduplex. By doing so, we are able to adopt a unified approach toward modeling canonical and non-canonical heteroduplexes. This creates a numerical feature that makes it easier for our ML classifier (and other ML-based approaches) to use this feature for classification. We also demonstrate that a whole gamut of non-canonical seed matches, involving bulges on the mRNA, are enriched in the set of positive miRNA-mRNA interactions, seen in both human and mouse-derived data. In fact, the proportion of non-canonical matches is higher than that of canonical matches. This category of matches had been missed in much of prior work, e.g., [32, 33]. Importance of seed and seedless matches Early studies on miRNA target recognition revealed near-perfect (contiguous) and conserved, Watson-Crick complementarity at the 5’ miRNA end, which was called the “seed region”. The seed is a 6–8 nt substring within the first 8 nucleotides, starting from the 5’ miRNA end. Typically, positions 2–7 from the 5’ end are considered to be the primary (canonical) determinant of target specificity [34– 37]. However, given the large number of random occurrences of any given hexamer in 3’ UTRs, a canonical “seed” match by itself is a poor predictor of miRNA-based regulation [38]. To complicate matters, non-canonical interactions involving “seedless sites”, where the interactions are not nucleated by perfectly complementary miRNA seed regions and yet effectively downregulate gene expression, have been described [39–45]. Popular sequence alignment tools such as BLAST cannot align short sequences with specific bulges or mismatch configurations [46]. Taken together, computational methods for miRNA target prediction have traditionally focused on canonical (seedbased) matches. Along the same lines, interactions with the 3’ UTRmRNA target region have been primarily modeled, as opposed to the 5’ UTR, or CDS, or non-coding mRNA regions. In our work, we remove these two restrictions and find seedless matches (in addition to the seed matches) throughout the gene regions3. Related work Among non-canonical prediction methods, mirSVR [26] allows for a single GU wobble or a mismatch in the 6mer seed region. For encoding the seed match pattern, mirSVR uses an 8-bit long vector, with “1” representing a match and “0” representing a mismatch and then uses the bit-vector as a feature in their Support Vector Regression (SVR) model. Recent methods have expanded the target search to other genic regions, such as, to the 5’ UTR and coding sequence (CDS) [27, 47]. In this bracket, Liu et al. generate predictions for sites involving non-canonical (seedless) matches. However, they do not take into consideration the type of non-canonicality for the examined seedless sites. Instead, they use thermodynamic and mRNA sequence features (e.g., local AU content) to generate predictions for the seedless sites. In doing so, they miss out on potential signal from the non-canonical seedless match patterns that our findings indicate as enriched in the identified functional miRNA-mRNA interactions. One possible reason for this, as pointed out by Xu et al. [48], is the difficulty in incorporating the large numbers of possible patterns of insertions and deletions in the mRNA seed-matched region for different non-canonical seedless match patterns. Computational methods have also exclusively relied on thermodynamic features, such as the stability of the miRNA-mRNA heteroduplex and the accessibility of the mRNA target region to identify functional miRNA binding sites. For example, Xu et al. [48] only use binding energy and accessibility to predict functional miRNA target sites. Another method, MIRZA [28] develops a rigorous biophysical model via parameterizing the alignment between a miRNA and an mRNA segment, interpreted as the binding energy between the two, and optimized using CLIPseq data. While MIRZA uses a novel model to incorporate canonical and non-canonical matches in a unifiedmanner, it does not take into account secondary mRNA structures (the spatial configuration) in developing their energy model—mRNA secondary structures can potentially limit the target site accessibility to the docking miRNA-RISC complex and therefore plays an important role in miRNA target recognition [32]. Further, these approaches compute various thermodynamic scores only at the target site region to summarize the thermodynamics of the miRNAmRNA interaction. For example, Xu et al. [48] observe that a certain normalized measure of site accessibility, has a characteristic pattern around the target site region. Yet, they do not exploit this observation in their model. In contrast, while Liu et al. [27] compute site accessibility in the target region’s vicinity in discrete chunks of 5, 10, 15, 20, 25, and 30 nt around the target site region, they report only the accessibility computed at the target site region as an important predictor of functional miRNA targets, failing to capture the mRNA secondary structures around the target site, which define its structural accessibility. Identifying this, prompted us to characterize binding energy and accessibility as curves to model the spatial profile of the miRNA-mRNA interaction. Ghoshal et al. BMC Genomics (2015) 16:999 Page 4 of 21 Results and Discussion This Section is segmented into four Sub-Sections. In the first, titled “Approach” we describe our overall solution approach. Next, in the Sub-Section titled “Results”, we present our experimental results and draw our observations from them. Third, we compare our performance visà-vis competition. Finally, we discuss some salient points in developing our model and present potential threats to the validity of our approach. Approach While CLIP-seq datasets identify short mRNA regions that are functional AGO-mRNA interaction sites [18, 49, 50], additional bioinformatic analysis is needed to identify the miRNA-mRNA binding sites. In order to identify miRNAs that might target those AGO-crosslinked regions, we followed the same general approach as previous methods [26, 27, 32, 47]. The idea is to first generate a candidate set of mRNA binding sites for a list of mRNAs and miRNAs by enforcing a minimum threshold on the alignment score 4 or the minimum free energy of hybridization ( G) of the miRNA and mRNA and/or using seed-match constraints. In the next step, different methods use varied approaches to identify true miRNA target sites within the generated candidate set of miRNAmRNA interactions. Previous works have used various criteria to generate the candidate set of mRNA target sites. For example, in mirSVR [26], the authors use the miRanda algorithm [51] to generate the initial candidate set. The miRanda algorithm computes an optimal local alignment of the miRNA with an mRNA sequence, by using various parameters for the overall alignment score, gap opening, and gap extension penalties. The authors generate candidate sites involving canonical seed matches, which they define as “sites that contain minimally a 6-mer perfect match, spanning miRNA positions 2 to 7”, and non-canonical seed matches. However, for the latter, they only allow a single G:U wobble or a single mismatch in the seed region. In [27], Liu et al. use two criteria for generating the candidate set. First, they use the RNAhybrid program [52] to generate candidate sites by enforcing a threshold of –15 kcal/mol on the thermodynamic binding energy ( G). Second, they constrain the seed match alignment, without constraining the binding energy for the match, to belong to one of the five seed classes of miRNA seeds, as defined in [4]. It is easy to see that by starting out with a more restricted set of candidate target sites, a method can achieve a higher true-positive rate for identifying positive target sites within this restrictive set. However, this would be at the cost of missing out on a large number of positive target sites that are not present in the candidate set in the first place. For our method, we select the least restrictive filter to have the most expansive superset for the initial selection of possible target locations genome-wide. Specifically, in our case we have the thermodynamic cut-off of –15 kcal/mol, and then, to generate seed-sites, we constrain the seed match to be at least a 6-mer without using any additional constraints. The cut-off value of –15 kcal/mol was the least restrictive among previous work that consider thermodynamic binding [27, 47]. If a target region meets either of these two criteria, then we include it in the candidate set. Thus, we challenge our model by coming up with the most expansive set of potential target sites. This shows up quantitatively in Table 1, where we see that the dataset that we evaluate on is the largest among all prior work. Algorithm 1 describes our process of generating the candidate set. Line 3 of the algorithm extracts all the candidate target regions for an miRNA-mRNA pair for which the binding energy ( G) is less than –15 kcal/mol. Line 4 of the algorithm extracts all target sites which have at least a 6-mer seed match. Line 5 removes from the entire set of target sites (extracted in step 3), those target sites that also have a seed match. Finally, the algorithm returns a set of seed and seedless target sites for each miRNA-mRNA pair. Algorithm 1 Candidate set generation for miRNA target prediction Input: mRNA listM, miRNA listN Output: Candidate target locations, involving canonical seed matches Os and non-canonical seed matches O, for all mRNA miRNA pairsM×N . 1: formRNAm ∈ M do 2: formiRNA n ∈ N do 3: Omn = Omn ∪ {t : G(m[ t] , n) ≤ −15kcal/mol} t is a target site inm as computed by RNAhybrid [52] 4: Os mn = Os mn ∪ {t : 6 ≤ ∣∣seedMatch(m[ t] , n)∣∣ ≤ 8} ∣∣seedMatch(m[ t] , n)∣∣ is the length of the seed match 5: Omn = Omn \Os mn 6: end for
منابع مشابه
Comparing MicroRNA Target Gene Predictions Related to Alzheimer's Disease Using Online Bioinformatics Tools
Introduction: The prediction of microRNAs related to target genes using bioinformatics tools saves time and costs of the experimental analyses. In the present study, the prediction of microRNA target genes relevant to Alzheimer’s Diseases (AD) were compared with the experimentally reported data using different bioinformatics tools. Method: A total of 41 microRNAs associated with 21 essential ge...
متن کاملComparing MicroRNA Target Gene Predictions Related to Alzheimer's Disease Using Online Bioinformatics Tools
Introduction: The prediction of microRNAs related to target genes using bioinformatics tools saves time and costs of the experimental analyses. In the present study, the prediction of microRNA target genes relevant to Alzheimer’s Diseases (AD) were compared with the experimentally reported data using different bioinformatics tools. Method: A total of 41 microRNAs associated with 21 essential ge...
متن کاملWeighted sequence motifs as an improved seeding step in microRNA target prediction algorithms.
We present a new microRNA target prediction algorithm called TargetBoost, and show that the algorithm is stable and identifies more true targets than do existing algorithms. TargetBoost uses machine learning on a set of validated microRNA targets in lower organisms to create weighted sequence motifs that capture the binding characteristics between microRNAs and their targets. Existing algorithm...
متن کاملNew support vector machine-based method for microRNA target prediction.
MicroRNA (miRNA) plays important roles in cell differentiation, proliferation, growth, mobility, and apoptosis. An accurate list of precise target genes is necessary in order to fully understand the importance of miRNAs in animal development and disease. Several computational methods have been proposed for miRNA target-gene identification. However, these methods still have limitations with resp...
متن کاملA Probabilistic Method for Prediction of microRNA-target Interactions
Elucidation of microRNA activity is a crucial step in understanding gene regulation. One key problem in this effort is how to model the pairwise interaction of microRNAs with their targets. As this interaction is strongly mediated by their sequences, it is desired to set up a probabilistic model to explain the binding between a microRNA sequence and the sequence of a putative target. To this en...
متن کاملNaïve Bayes for microRNA target predictions - machine learning for microRNA targets
MOTIVATION Most computational methodologies for miRNA:mRNA target gene prediction use the seed segment of the miRNA and require cross-species sequence conservation in this region of the mRNA target. Methods that do not rely on conservation generate numbers of predictions, which are too large to validate. We describe a target prediction method (NBmiRTar) that does not require sequence conservati...
متن کامل